Data Analysis

Comprehensive Data Cleaning & Exploratory Analysis of Job Market Trends

Authors
Affiliation

Connor Coulter

Boston University

Wei Wang

Boston University

Balqis Bevi Abdul Hannan Kanaga

Boston University

1 Import Data

import pandas as pd, os

csv_path = "data/lightcast_job_postings.csv"
df = pd.read_csv(csv_path, low_memory=False)

print("Rows, Cols:", df.shape)
print(list(df.columns)[:40])
df.head(5)
Rows, Cols: (72498, 131)
['ID', 'LAST_UPDATED_DATE', 'LAST_UPDATED_TIMESTAMP', 'DUPLICATES', 'POSTED', 'EXPIRED', 'DURATION', 'SOURCE_TYPES', 'SOURCES', 'URL', 'ACTIVE_URLS', 'ACTIVE_SOURCES_INFO', 'TITLE_RAW', 'BODY', 'MODELED_EXPIRED', 'MODELED_DURATION', 'COMPANY', 'COMPANY_NAME', 'COMPANY_RAW', 'COMPANY_IS_STAFFING', 'EDUCATION_LEVELS', 'EDUCATION_LEVELS_NAME', 'MIN_EDULEVELS', 'MIN_EDULEVELS_NAME', 'MAX_EDULEVELS', 'MAX_EDULEVELS_NAME', 'EMPLOYMENT_TYPE', 'EMPLOYMENT_TYPE_NAME', 'MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE', 'IS_INTERNSHIP', 'SALARY', 'REMOTE_TYPE', 'REMOTE_TYPE_NAME', 'ORIGINAL_PAY_PERIOD', 'SALARY_TO', 'SALARY_FROM', 'LOCATION', 'CITY', 'CITY_NAME']
ID LAST_UPDATED_DATE LAST_UPDATED_TIMESTAMP DUPLICATES POSTED EXPIRED DURATION SOURCE_TYPES SOURCES URL ... NAICS_2022_2 NAICS_2022_2_NAME NAICS_2022_3 NAICS_2022_3_NAME NAICS_2022_4 NAICS_2022_4_NAME NAICS_2022_5 NAICS_2022_5_NAME NAICS_2022_6 NAICS_2022_6_NAME
0 1f57d95acf4dc67ed2819eb12f049f6a5c11782c 9/6/2024 2024-09-06 20:32:57.352 Z 0.0 6/2/2024 6/8/2024 6.0 [\n "Company"\n] [\n "brassring.com"\n] [\n "https://sjobs.brassring.com/TGnewUI/Sear... ... 44.0 Retail Trade 441.0 Motor Vehicle and Parts Dealers 4413.0 Automotive Parts, Accessories, and Tire Retailers 44133.0 Automotive Parts and Accessories Retailers 441330.0 Automotive Parts and Accessories Retailers
1 0cb072af26757b6c4ea9464472a50a443af681ac 8/2/2024 2024-08-02 17:08:58.838 Z 0.0 6/2/2024 8/1/2024 NaN [\n "Job Board"\n] [\n "maine.gov"\n] [\n "https://joblink.maine.gov/jobs/1085740"\n] ... 56.0 Administrative and Support and Waste Managemen... 561.0 Administrative and Support Services 5613.0 Employment Services 56132.0 Temporary Help Services 561320.0 Temporary Help Services
2 85318b12b3331fa490d32ad014379df01855c557 9/6/2024 2024-09-06 20:32:57.352 Z 1.0 6/2/2024 7/7/2024 35.0 [\n "Job Board"\n] [\n "dejobs.org"\n] [\n "https://dejobs.org/dallas-tx/data-analys... ... 52.0 Finance and Insurance 524.0 Insurance Carriers and Related Activities 5242.0 Agencies, Brokerages, and Other Insurance Rela... 52429.0 Other Insurance Related Activities 524291.0 Claims Adjusting
3 1b5c3941e54a1889ef4f8ae55b401a550708a310 9/6/2024 2024-09-06 20:32:57.352 Z 1.0 6/2/2024 7/20/2024 48.0 [\n "Job Board"\n] [\n "disabledperson.com",\n "dejobs.org"\n] [\n "https://www.disabledperson.com/jobs/5948... ... 52.0 Finance and Insurance 522.0 Credit Intermediation and Related Activities 5221.0 Depository Credit Intermediation 52211.0 Commercial Banking 522110.0 Commercial Banking
4 cb5ca25f02bdf25c13edfede7931508bfd9e858f 6/19/2024 2024-06-19 07:00:00.000 Z 0.0 6/2/2024 6/17/2024 15.0 [\n "FreeJobBoard"\n] [\n "craigslist.org"\n] [\n "https://modesto.craigslist.org/sls/77475... ... 99.0 Unclassified Industry 999.0 Unclassified Industry 9999.0 Unclassified Industry 99999.0 Unclassified Industry 999999.0 Unclassified Industry

5 rows × 131 columns

2 Data Cleaning & Preprocessing

2.1 Dropping Unnecessary Columns

columns_to_drop = [
    "ID","LAST_UPDATED_TIMESTAMP","DUPLICATES","ACTIVE_URLS","ACTIVE_SOURCES_INFO",
    "TITLE_RAW","BODY","COMPANY_RAW",
    "NAICS2","NAICS2_NAME","NAICS3","NAICS3_NAME","NAICS4","NAICS4_NAME",
    "NAICS5","NAICS5_NAME","NAICS6","NAICS6_NAME",
    "NAICS_2022_2","NAICS_2022_2_NAME","NAICS_2022_3","NAICS_2022_3_NAME",
    "NAICS_2022_4","NAICS_2022_4_NAME","NAICS_2022_5","NAICS_2022_5_NAME",
    "SOC_2","SOC_2_NAME","SOC_3","SOC_3_NAME","SOC_5","SOC_5_NAME",
    "CIP2","CIP2_NAME","CIP4","CIP4_NAME","CIP6","CIP6_NAME",
    "LOT_CAREER_AREA","LOT_CAREER_AREA_NAME","LOT_OCCUPATION","LOT_OCCUPATION_NAME",
    "LOT_SPECIALIZED_OCCUPATION","LOT_SPECIALIZED_OCCUPATION_NAME",
    "LOT_OCCUPATION_GROUP","LOT_OCCUPATION_GROUP_NAME",
    "LOT_V6_SPECIALIZED_OCCUPATION","LOT_V6_SPECIALIZED_OCCUPATION_NAME",
    "LOT_V6_OCCUPATION","LOT_V6_OCCUPATION_NAME","LOT_V6_OCCUPATION_GROUP",
    "LOT_V6_OCCUPATION_GROUP_NAME","LOT_V6_CAREER_AREA","LOT_V6_CAREER_AREA_NAME",
    "ONET","ONET_NAME","ONET_2019","ONET_2019_NAME"
]
drop_existing = [c for c in columns_to_drop if c in df.columns]
df.drop(columns=drop_existing, inplace=True)
print("Remaining columns (first 30):", list(df.columns)[:30])
Remaining columns (first 30): ['LAST_UPDATED_DATE', 'POSTED', 'EXPIRED', 'DURATION', 'SOURCE_TYPES', 'SOURCES', 'URL', 'MODELED_EXPIRED', 'MODELED_DURATION', 'COMPANY', 'COMPANY_NAME', 'COMPANY_IS_STAFFING', 'EDUCATION_LEVELS', 'EDUCATION_LEVELS_NAME', 'MIN_EDULEVELS', 'MIN_EDULEVELS_NAME', 'MAX_EDULEVELS', 'MAX_EDULEVELS_NAME', 'EMPLOYMENT_TYPE', 'EMPLOYMENT_TYPE_NAME', 'MIN_YEARS_EXPERIENCE', 'MAX_YEARS_EXPERIENCE', 'IS_INTERNSHIP', 'SALARY', 'REMOTE_TYPE', 'REMOTE_TYPE_NAME', 'ORIGINAL_PAY_PERIOD', 'SALARY_TO', 'SALARY_FROM', 'LOCATION']

2.2 Handling Missing Values

try:
    import missingno as msno, matplotlib.pyplot as plt
    msno.heatmap(df)
    plt.title("Missing Values Heatmap")
    plt.show()
except Exception as e:
    print("missingno heatmap skipped:", e)

df.dropna(thresh=len(df) * 0.5, axis=1, inplace=True)

if "SALARY" in df.columns:
    df["SALARY"] = pd.to_numeric(df["SALARY"], errors="coerce")
    df["SALARY"].fillna(df["SALARY"].median(), inplace=True)

for col in df.select_dtypes(include="object").columns:
    df[col].fillna("Unknown", inplace=True)

/var/folders/g7/sfc5tly50013vn_cy1c842180000gn/T/ipykernel_7902/1704429780.py:16: FutureWarning:

A value is trying to be set on a copy of a DataFrame or Series through chained assignment using an inplace method.
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


2.3 Removing Duplicates

subset_cols = [c for c in ["TITLE","COMPANY_NAME","LOCATION","POSTED"] if c in df.columns]
if subset_cols:
    before = len(df)
    df.drop_duplicates(subset=subset_cols, keep="first", inplace=True)
    print(f"Removed {before - len(df)} duplicates using {subset_cols}")
else:
    print("No standard duplicate keys found; skipping.")
Removed 3300 duplicates using ['TITLE', 'COMPANY_NAME', 'LOCATION', 'POSTED']

3 Exploratory Data Analysis (EDA)

3.1 Job Postings by Industry

import plotly.express as px

industry_col = (
    "NAICS_2022_6_NAME" if "NAICS_2022_6_NAME" in df.columns
    else ("INDUSTRY" if "INDUSTRY" in df.columns else None)
)
if industry_col is None:
    print("No industry column found.")
else:
    counts = (
        df[industry_col]
        .value_counts()
        .head(15)
        .reset_index(name="Job Postings")
        .rename(columns={industry_col: "Industry"})
        .sort_values("Job Postings")
    )
    fig = px.bar(
        counts,
        x="Job Postings",
        y="Industry",
        orientation="h",
        title="Top 15 Industries by Number of Job Postings"
    )
    fig.show()

3.2 Salary Distribution by Industry

import numpy as np, plotly.express as px

industry_col = (
    "NAICS_2022_6_NAME" if "NAICS_2022_6_NAME" in df.columns
    else ("INDUSTRY" if "INDUSTRY" in df.columns else None)
)
salary_candidates = ["SALARY","SALARY_MEDIAN","SALARY_MID","SALARY_ANNUAL","PAY_RATE"]
salary_col = next((c for c in salary_candidates if c in df.columns), None)

if industry_col is None or salary_col is None:
    print("Missing salary or industry column for plot.")
else:
    sdf = df[[industry_col, salary_col]].copy()
    sdf[salary_col] = pd.to_numeric(sdf[salary_col], errors="coerce")
    sdf = sdf.dropna(subset=[salary_col]).query(f"{salary_col} > 0")
    fig2 = px.box(
        sdf,
        x=industry_col,
        y=salary_col,
        title="Salary Distribution by Industry",
        points=False
    )
    fig2.update_layout(xaxis_tickangle=-45)
    fig2.show()
Missing salary or industry column for plot.

3.3 Remote vs. On-Site Jobs

import plotly.express as px

if "REMOTE_TYPE_NAME" in df.columns:
    rc = df["REMOTE_TYPE_NAME"].value_counts().reset_index()
    rc.columns = ["Remote Type","Count"]
    fig3 = px.pie(
        rc,
        names="Remote Type",
        values="Count",
        title="Remote vs. On-Site Job Distribution"
    )
    fig3.show()
else:
    print("No REMOTE_TYPE_NAME column available.")

3.4 EDA: Rationale & Insights

3.4.1 Why these visualizations

  • Job Postings by Industry (bar chart) compares demand across industries to prioritize where to focus a job search.
  • Salary Distribution by Industry (box plot) shows median and spread, robust to skew in salary data.
  • Remote vs. On-Site (pie chart) summarizes work-arrangement mix to set expectations for location flexibility.

3.4.2 Key insights from the graphs

  • Industry demand: A few industries dominate postings; target those first while keeping secondary sectors.
  • Salary patterns: Wide dispersion in some industries implies upside for well-positioned candidates.
  • Work arrangement mix: Higher remote share expands options; lower share requires local strategy.
  • Actionable takeaway: Combine where demand is high, where salary upside exists, and where work arrangement fits your constraints to prioritize applications.